Toward a Definitive Compressibility Measure for Repetitive Sequences
نویسندگان
چکیده
While the $k$ th order empirical entropy is an accepted measure of compressibility individual sequences on classical text collections, it useful only for small values and thus fails to capture repetitive sequences. In absence established way quantifying latter, ad-hoc measures like size notation="LaTeX">$z$ Lempel–Ziv parse are frequently used estimate repetitiveness. The notation="LaTeX">$b \le z$ smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though NP-complete compute, not monotone upon appending symbols. Recently, a more principled measure, notation="LaTeX">$\gamma $ string attractor , was introduced. b$ lower-bounds all previous relevant ones, while length- notation="LaTeX">$n$ strings represented efficiently indexed within space notation="LaTeX">$O\left({\gamma \log \frac {n}{\gamma }}\right)$ which also upper-bounds many measures, including . Although arguably repetitiveness than notation="LaTeX">$b$ compute monotone, unknown if one represent in notation="LaTeX">$o(\gamma n)$ space. this paper, we study even smaller notation="LaTeX">$\delta \gamma computed linear time, allows encoding every string notation="LaTeX">$O\left({\delta {n}{\delta because notation="LaTeX">$z = O\left({\delta We argue that strings. Concretely, show (1) strictly by up logarithmic factor; (2) there families needing notation="LaTeX">$\Omega \left({\delta encoded, so optimal ; (3) build run-length context-free grammars whereas (non-run-length) grammar notation="LaTeX">$\Theta (\log n/\log times larger; (4) space, but offer logarithmic-time access its symbols, computation substring fingerprints, efficient searches pattern occurrences. further refine above results account alphabet notation="LaTeX">$\sigma string, showing {n\log \sigma }{\delta n}}\right)$ necessary sufficient support access, fingerprinting, matching queries.
منابع مشابه
Development of A Questionnaire to Measure Attitude toward Oocyte Donation
Background To our knowledge, there is no valid and comprehensive questionnaire that considers attitude toward oocyte donation (OD). Therefore this study has aimed to design and develop a tool entitled attitude toward donation-oocyte (ATOD-O) to measure attitude toward OD. MaterialsAndMethods This methodological, qualitative research was undertaken on 15 infertile cases. In addition, we performe...
متن کاملJunk DNA - repetitive sequences
Eukaryote and also human DNA contains large portion of noncoding sequences. As for the coding DNA, the noncoding DNA may be unique or in more identical or similar copies. DNA sequences with high copy numbers are then called repetitive sequences. If the copies of a sequence motif lie adjacent to each other in a block, or an array, we are speaking about tandem repeats, the repetitive sequences di...
متن کاملA CF-Based Randomness Measure for Sequences
This note examines the question of randomness in a sequence based on the continued fraction (CF) representation of its corresponding representation as a number, or as D sequence. We propose a randomness measure that is directly equal to the number of components of the CF representation. This provides a means of quantifying the randomness of the popular PN sequences as well. A comparison is made...
متن کاملA Distance Measure for Video Sequences
Video is a unique multimedia data type, in that it comes with distinguished spatio-temporal constraints. Content-based video retrieval thus requires methods for video sequence-to-sequence matching, incorporating the temporal ordering inherent in a video sequence, without losing sight of the visual nature of the information in the sequence. Such methods will require reliable measures of similari...
متن کاملAperiodicity Measure for Infinite Sequences
We introduce the notion of aperiodicity measure for in nite symbolic sequences. Informally speaking, the aperiodicity measure of a sequence is the maximum number (between 0 and 1) such that this sequence di ers from each of its non-identical shifts in at least fraction of symbols being this number. We give lower and upper bounds on the aperiodicity measure of a sequence over a xed alphabet. We ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Information Theory
سال: 2023
ISSN: ['0018-9448', '1557-9654']
DOI: https://doi.org/10.1109/tit.2022.3224382